Skip to content

data(sources): enrich candidate source list from ChatGPT survey 3#35

Merged
shaypal5 merged 1 commit into
mainfrom
docs/survey-3-source-enrichment
May 24, 2026
Merged

data(sources): enrich candidate source list from ChatGPT survey 3#35
shaypal5 merged 1 commit into
mainfrom
docs/survey-3-source-enrichment

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

What

Adds 3 new candidate source records and enriches 10 existing candidates with
specific sub-collection URLs, scale data, and access notes, based on a
prioritised commercial-use survey of Hebrew manuscript repositories.

New sources

source_id What
commons__hebrew_language_manuscripts Wikimedia Commons parent category: 17 subcats + ~105 direct files (Cairo Geniza, Bible MSS, illuminated MSS, Wellcome, Damascus Pentateuch)
commons__hebrew_calligraphy Wikimedia Commons: ~74 files + subcats — illuminated MSS and ketubot
openn__judaica_collection_index OPenn Judaica umbrella index; covers Gaster Hebrew MSS and other sub-collections not yet individually tracked

Enriched sources

All existing OPenn candidates now have specific sub-collection landing URLs instead of the generic openn.library.upenn.edu/:

source_id Key change
openn__bl_hebrew_manuscripts Landing URL (collection 0032), scale confirmed ~435,000 images
openn__cairo_genizah_fragments Landing URL (genizah_contents.html)
openn__manchester_hebrew_manuscripts Landing URL (0021.html) + critical caveat: Manchester own viewer is CC BY-NC; use OPenn-hosted copy (CC BY 4.0)
openn__katz_center_judaica Landing URL (0002.html)
openn__zucker_ketubah_collection Landing URL (0051.html)
leipzig__hebrew_manuscripts Corrected from manuscripta-mediaevalia.de to Leipzig's own page (holds the PD rights statement)
nypl__hebrew_manuscripts_digital_collections 1,174 results count added
mdz__hebrew_manuscripts Landing URL + scale (~700 pieces incl. 183 fragments, 12th–18th c.)
archive__hebrew_manuscripts Named high-value items: Leningrad Codex, Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible
huggingface__sivan22_hebrew_handwritten Licence confirmed CC BY 3.0; 5,093 rows / 28 classes; added policy-review note (CC-BY-3.0 not yet in AGENTS.md accepted list)

Docs

  • docs/sources/chatgpt_summary_3.md — full survey with prioritised ingestion ordering

Validation

ok: 93 sources, 345 entries, 345 files verified, recipe ok
80 passed

🤖 Generated with Claude Code

Adds 3 new candidate sources and enriches 10 existing candidates with
specific URLs, scale data, and access notes surfaced by a prioritised
commercial-use source survey (chatgpt_summary_3.md).

New sources
───────────
- commons__hebrew_language_manuscripts — Wikimedia Commons parent
  category (17 subcats + ~105 files: Cairo Geniza, Bible MSS,
  illuminated MSS, Wellcome, Damascus Pentateuch)
- commons__hebrew_calligraphy — Wikimedia Commons (~74 files + subcats;
  illuminated MSS and ketubot)
- openn__judaica_collection_index — OPenn Judaica umbrella index
  (openn.library.upenn.edu/html/judaica_contents.html); covers Gaster
  Hebrew MSS and other sub-collections not yet individually tracked

Enriched sources
────────────────
- openn__bl_hebrew_manuscripts: landing URL (collection 0032), scale
  confirmed ~435,000 images, Polonsky Foundation ref added
- openn__cairo_genizah_fragments: landing URL (genizah_contents.html)
- openn__manchester_hebrew_manuscripts: landing URL (0021.html) + critical
  caveat — Manchester own viewer is CC BY-NC; use OPenn copy (CC BY 4.0)
- openn__katz_center_judaica: landing URL (0002.html)
- openn__zucker_ketubah_collection: landing URL (0051.html)
- leipzig__hebrew_manuscripts: corrected URL to Leipzig direct page
- nypl__hebrew_manuscripts_digital_collections: 1,174 results count added
- mdz__hebrew_manuscripts: landing URL + scale (~700 pcs incl. 183 fragments)
- archive__hebrew_manuscripts: named high-value items (Leningrad Codex,
  Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible)
- huggingface__sivan22_hebrew_handwritten: CC BY 3.0 licence detail,
  5,093 rows / 28 classes, added policy-review note (CC-BY-3.0 not
  explicitly in AGENTS.md accepted list)

Validation: ok: 93 sources, 345 entries, 345 files verified, recipe ok
Tests: 80 passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 added enhancement New feature or request area:data Dataset rows or scan files area:docs README, AGENTS.md, or docs/* size:M Medium PR (multi-file, bounded scope) labels May 24, 2026
@shaypal5 shaypal5 merged commit ad358ed into main May 24, 2026
1 check passed
@shaypal5 shaypal5 deleted the docs/survey-3-source-enrichment branch May 24, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset rows or scan files area:docs README, AGENTS.md, or docs/* enhancement New feature or request size:M Medium PR (multi-file, bounded scope)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant